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This paper presents an integrated tool for German morphology and statistical 
part-of-speech tagging which aims at making some well established methods 
widely available. The software is very user friendly, runs on any PC and can 
be downloaded as a complete package (including lexicon and documentation) 
from the World Wide Web. Compared with the performance of other tagging 
systems the tagger produces similar results. 

Es wird ein integriertes Programmpaket vorgestellt, das ein Morphologie- und 
ein Taggingmodul fur das Deutsche enthalt. Die frei verfugbare Software ze- 
ichnet sich insbesondere durch hohe Benutzerfreundlichkeit aus und kann fiber 
das World Wide Web bezogen werden. Die Qualitat der erzielten Ergebnisse 
entspricht dem derzeitigen Stand der Forschung. 



1 Introduction 

Morphology systems, lemmatisers and part-of-speech taggers are some of the 
basic tools in natural language processing. There are numerous applications, 
including syntax parsing, machine translation, automatic indexing and seman- 
tic clustering of words. Unfortunately, for languages other than English, such 
tools are rarely available, and different research groups are often forced to rede- 
velop them over and over again. Considering German, quite a few morphology 
systems (Hausser 1996) and taggers (see table |l|) have been developed, which 
are described in Wothke et al. (1993) (IBM Heidelberg), Steiner (1995) (Uni- 
versity of Minister), Feldweg (1995) (University of Tubingen), Schmid (1995) 
(University of Stuttgart), Armstrong et al. (1995) (ISSCO Geneva), and Lezius 
(1995) (University of Paderborn). However, in most cases, the tagger is isolated 
from the morphology system. It relies on a lexicon of full forms which, of course, 
may be generated by a morphological tool. Unfortunately, most German lexi- 
cons are not available due to copyright restrictions and - as far as we know - 
none of them is public-domain. Therefore we have decided to make our system 
Morphy publicly available. It combines a morphological and tagging module in 
a single package and can be downloaded from the World Wide Web.[] 

*In: D. Gibbon, od., Natural Language Processing and Speech Technology. Results of the 
3rd KON VENS Conference. Bielefeld. October 1996. Mouton dc Gru vter. Berlin, 1996. 
1 URL: http: / /www-psycho. uni-padcrborn.de/lczius/morpho.htmj 



Table 1 : Comparison of German Taggers 
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Since it has been created not only for linguists, but also for second language 
learners, it has been designed for standard PC-platforms and great effort has 
been put in making it as easy to use as possible. 

2 The morphology module of Morphy 

The morphology system is based on the Duden grammar (Drosdowski 1984). It 
consists of three parts: Analysis, generation and lexical system. 

The lexical system is more sophisticated than other systems in order to allow 
a user-friendly extension of the lexicon. When entering a new word, the user 
is asked the minimal number of questions which are necessary to infer the new 
word's grammatical features and which any native speaker should be able to 
answer. In most cases only the root of the word has to be typed in, questions are 
answered by pressing the number of the correct alternative (see figure [l] for the 
dialogue when entering the verb telefonieren) . Currently, the lexicon comprises 
21.500 words (about 100.000 word forms) and is extended continuously. 

Starting from the root of a word and the inflexion type as stored in the lexi- 
con, the generation system produces all inflected forms which are shown on the 
screen. Among other morphological features it considers vowel mutation, shifts 
between fi and ss as well as pre- and infixation of markers for participles. 

The analysis system for each word form determines its root and its part of 
speech, and, if appropiate, its gender, case, number, tense and comparative 
degree. It also segments compound nouns using a longest-matching rule which 
works from right to left. Since the system treats each word form separately, 
ambiguities can not be resolved. For ambiguous word forms any possible lemma 
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2. 


Wird das Verb schwach konjugiert? 




1: Ja 




2: Nein 


3. 


Wie lautet die 2. Person Singular Prasens? 




1: du tclcfonierst 




2: du telefonierest 




3: du telefoniert 


4. 


Wie lautet das Partizip des Verbs? 




1: telefoniert 




2: getelefoniert 


Verb klassifiziert! 



Figure 1: Dialogue when entering telefonieren (user input is printed bold type) 

and morphological description is given (for some examples see table |^). If a 
word form can not be recognised, its part of speech is predicted by an algorithm 
which makes use of statistical data on German suffix frequencies. 

Morphy's lookup-mechanism when analyzing texts is not based on a lexicon of 
full forms. Instead, there is only a lexicon of roots together with their inflexion 
types. When analyzing a word form, Morphy cuts off all possible suffixes, builds 
the possible roots, looks up these roots in the lexicon, and for each root generates 
all possible inflected forms. Only those roots which lead to inflected forms 
identical to the original word form will be selected (for details see Lezius 1994). 

Naturally, this procedure is much slower than the simple lookup- mechanism in a 
full form lexicon^] Nevertheless, there are advantages: First, the lexicon can be 
kept small J^| which is an important consideration for a PC-based system intended 
to be widely distributed. Secondly, the processing of German compound nouns 
fits in this concept. 

The performance of the morphology system has been tested at the Mor- 
pholympics conference 1994 in Erlangen (see Hausser (1996), pp. 13-14, and 
Lezius (1996)) with a specially designed test corpus which had been unknown to 
the participants. This corpus comprised about 7.800 word forms and consisted 
of different text types (two political speeches, a fragment of the Limas-corpus 
and a list of special word forms). Morphy recognised 89.2%, 95.9%, 86.9% and 
75.8% of the word forms, respectively. 

2 Morphy's current analysis speed is about 50 word forms per second on a fast PC, which 
is sufficient for many purposes. For the processing of larger corpora we have used Morphy to 
generate a full-form lexicon under UNIX. This has led to an analysis speed of many thousand 
word forms per second. 

^Only 750 KB memory is necessary for the current lexicon. 
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Table 2: Some examples of the morphological analysis 



word form 


morphological description 


root 


Fliissen 


Substantiv Dativ Plural maskulinum 


Flufi 


Bauern- 
hausern 


Substantiv Dativ Plural ncutrum 


Bauer / Haus 


Schiffahrts- 
hafcnmcisters 


Substantiv Gcnitiv Singular maskulinum 


Schiff / Fahrt / 
Hafen / Meister 


Kiissc 


Substantiv Nominativ Plural maskulinum 
Substantiv Genitiv Plural maskulinum 
Substantiv Akkusativ Plural maskulinum 
Verb 1. Person Singular Prasens 
Verb 1. Person Singular Konjunktiv 1 
Verb 3. Person Singular Konjunktiv 1 
Verb Imperativ Singular 


KuB 

KuB 

KuB 

kiissen 

kiissen 

kiissen 

kiissen 


cinnahm 


Verb 1. Person Singular Pratcritum 
Verb 3. Person Singular Pratcritum 


(ein)nchmcn 
(ein)nehmen 


vcrspieltest 


Verb 2. Person Singular Pratcritum 
Verb 2. Person Singular Konjunktiv 2 


ver-spielcn 
ver-spielen 


vcrspieltes 


Adjcktiv Nominativ Singular ncutrum 
Adjektiv Akkusativ Singular neutrum 


vcrspielt (ver-spielcn) 
verspielt (ver-spielcn) 


cdlcm 


Adjcktiv Dativ Singular ncutrum 
Adjcktiv Dativ Singular maskulinum 


edel 
edel 



3 The tagging module of Morphy 

Since morphological analysis operates on isolated word forms, ambiguities are 
not resolved. The task of the tagger is to resolve these ambiguities by taking 
into account contextual information. When designing a tagger, a number of 
decisions have to be made: 

• Selection of a tag set. 

• Selection of a tagging algorithm. 

• Selection of a training and test corpus. 

3.1 Tag Set 

Like the morphology system, the tagger is based on the classification of the 
parts of speech from the Duden grammar. Supplementary additions have been 
taken from the system of Bergenholtz and Schacder (1977). The so-formed tag 
set includes grammatical features as gender, case and number. This results in 
a very complex system, comprising about 1000 different tags (see Lezius 1995). 
Since only 456 tags were actually used in the training corpus, the tag set was 
reduced to half. However, most German word forms are highly ambiguous in 
this system (about 5 tags per word form on average). 
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Although the amount of information gained by this system is very high, previ- 
ous tagging algorithms with such large tag sets led to poor results in the past 
(see Wothke et al. 1993; Steiner 1995). This is because different grammatical 
features often have the same surface realization (e.g. nominative noun and ac- 
cusative noun are difficult to distinguish by the tagger). By grouping together 
parts of speech with different grammatical features this kind of error can be sig- 
nificantly reduced. This is what current small tag sets implicitly do. However, 
one has to keep in mind that the gain of information provided by the tagger is 
also reduced with a smaller tag set. 

Since some applications do not require detailed distinctions, we also constructed 
a small tag set comprising 51 tags as shown in table |3| Both tag sets are 
constructed in such a way that the large tag set can be directly mapped onto 
the small tag set. 

3.2 Tagging algorithm 

The tagger uses the Church-trigram-algorithm (Church 1988), which is still 
unsurpassed in terms of simplicity, robustness and accuracy. However, since we 
assumed that longer n-grams may give more information, and since we observed 
that some longer n-grams are rather frequent in corpora (see figure 2 for some 
statistics on the Brown-corpus), we decided to compare the Church algorithm 
with a tagging algorithm relying on variable context widths as described by 
Rapp (1995). 

Starting from an ambiguous word form which is to be tagged, this algorithm 
considers the preceding word froms - which have already been tagged - and 
the succeeding word forms still to be tagged. For this ambiguous word form 
the algorithm constructs all possible tag sequences composed of the already 
computed tags on the left, one of the possible tags of the critical word form and 
possible tags on the right. 

The choice of the tag for the critical word form is a function for the length of 
the tag sequences to the left and to the right which can be found in the training 
corpus. A detailed description of this algorithm is given in Rapp (Rapp 1995, 
pp. 149-154). 

Although some authors (Cutting et al. 1992; Schmid 1995; Feldweg 1995) claim 
that unsupervised tagging algorithms produce superior results, we chose super- 
vised learning. These publications pay little attention to the fact that algo- 
rithms for unsupervised tagging require great care (or even luck) when tuning 
some initial parameters. It frequently happens that unsupervised learning with 
sophisticated tag sets ends up in local minima, which can lead to poor results 
without any indication to the user. Such behavior seemed unacceptable for a 
standard tool. 
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Table 3: The small tag set (51 tags) 



tag name 


explanation of the tag 


example 


SUB 






Substantiv 


( dev ) Mann 


EIG 






Eigenname 


Egon, (Herr) Hansen 


VER 






finite Verbform 


spielst, lauft 


VER 


INF 




Infinitiv 


spielen, laufen 


VER 


PA2 




Partizip Perfekt 


gespielt, gelaufen 


VER 


EIZ 




crwcitcrtcr Infinitiv mit zu 


abzuspielen 


VER 


IMP 




Imperativ 


lauf, laufe 


VER 


AUX 




finite Hilfs verb form 


bin, hast 


VER 


AUX 


INF 


Infinitiv 


haben, sein 


VER 


AUX 


PA2 


Partizip Perfekt 


gahabt, gewesen 


VER 


AUX 


IMP 


Imperativ 


sei, habe 


VER 


MOD 




finite Modal verbform 


kannst, will 


VER 


MOD 


INF 


Infinitiv 


kdnnen, wollen 


VER 


MOD 


PA2 


Partizip Perfekt 


gekonnt, gewollt 


VER 


MOD 


IMP 


Imperativ 


kdnne 


ART 


IND 




unbestimmter Artikcl 


ein, eines 


ART 


DEF 




bestimmter Artikel 


dev, des 


ADJ 






Adjcktivform 


schnelle, kleinstes 


ADJ 


ADV 




Adjcktiv, advcrbiell 


(Er lauft) schnell. 


PRO 


DEM 


ATT 


Dcmonstrativpronomcn, attributiv 


diese (brau) 


PRO 


DEM 


PRO 


Dcmonstrativpronomcn, pronominal 


diese 


PRO 


REL 


ATT 


Relativpronomen, attributiv 


,dessen (Frau) 


PRO 


REL 


PRO 


Relativpronomcn, pronominal 


,welcher 


PRO 


POS 


ATT 


Possesivpronomcn, attributiv 


mein (Buch) 


PRO 


POS 


PRO 


Possesivpronomcn, pronominal 


(Das ist) meiner. 


PRO 


IND 


ATT 


Indefinitpronomcn, attributiv 


alle (Menschen) 


PRO 


IND 


PRO 


Indefinitpronomcn, pronominal 


(Ich mag) alle. 


PRO 


INR 


ATT 


Interrogativpronomen, attributiv 


T J 7 1 7 / '1 IT \ O 

Welcher (Mann) ? 


PRO 


INR 


PRO 


Interrogativpronomcn, pronominal 


WCVi 


PRO 


PER 




Personalpronomen 


er, wiv 


PRO 


REF 




Reflcxivpronomcn 


sich, uns 


ADV 






Adverb 


schon, manchmal 


ADV 


PRO 




Pronominaladverb 


damit, dadurch 


KON 


UNT 




unterordnendc Konjunktion 


dafl, da 


KON 


NEB 




nebenordnende Konjunktion 


und, oder 


KON 


INF 




Infinitivkonj unktion 


urn (zu spielen) 


KON 


VGL 




Vergleichskonj unktion 


als, denn, wie 


KON 


PRI 




Proportionalkonj unktion 


desto, um so, je 


PRP 






Proposition 


durch, an 


SKZ 






Sondcrklassc fur zu 


(,um) zu (spielen) 


ZUS 






Verbzusatz 


(spielst) ab 


INJ 






Inter jcktion 


Wau, Oh 


ZAL 






Zahlworter 


eins, tausend 


ZAN 






Zahlen 


100, 2 


ABK 






Abkiirzung 


Dr., usw. 


SZD 






Doppclpunkt 




SZE 






Satzcndczeichcn 


.!? 


SZG 






Gcdankenstrich 




SZK 






Komma 




szs 






Semikolon 




SZN 






sonstigc Satzzcichcn 


()/ 



G 



200 — 1 




size of the corpus 

Figure 2: Statistics on the Brown corpus: number of different n-grams occuring 
in the corpus versus size of the corpus (all figures in thousands) 

3.3 Training and test corpus 

For training and testing we took a fragment from the "Frankfurter- Rundschau" - 
corpus,^ which we have been collecting since 1992. Tables and other non-textual 
items were removed manually. A segment of 20.000 word forms was used for 
training, another segment of 5.000 word forms for testing. Any word forms not 
recognised by the morphology system were included in the lexicon. Using a 
special tagging editor which - on the basis of the morphology module - for each 
word gives a choice of possible tags, both corpora were tagged semiautomatically 
with the large tag set. A recent version of the editor additionally predicts the 
correct tag. 

4 Results 

Using the probabilities from the manually annotated training corpus, the test 
corpus was tagged automatically. The results were compared with the previous 
manual annotation of the test corpus. This was done for both tagging algorithms 
and tag sets. For the small tag set, the Church algorithm achieved an accuracy 
of 95.9%, whereas with the variable-context algorithm an accuracy of 95.0% was 
obtained. For the large tag set the respective figures are 84.7% and 81.8%. 

In comparison with other research groups, the results are similar for the small 
tagset and slightly better for the large tagset (see table |l|). Surprisingly, in- 
spite of considering less context, the Church algorithm performs better than the 
variable-context algorithm in both cases. 

4 This corpus was generously donated by the Druck- und Verlagshaus Frankfurt am Main 
and has been included in the CD-ROM of the European Corpus Initiative. We thank Gisela 
Zunker for her help with the acquisition and preparation of the corpus. 
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Church 

- — small. 




size of the training corpus 

Figure 3: Accuracy versus size of the training corpus for Church's trigram 
algorithm and the variable-context algorithm and both tag sets. 



This is the reason why the current implementation of Morphy only includes the 
Church algorithm.^] As an example, figure [| gives the annotation results of a 
few test sentences for both tag sets. 

However, there are also some advantages on the side of the variable-context 
algorithm. First, its potential when using larger training corpora seems to be 
slightly higher (see figure ||). Secondly, when the algorithm is modified in such 
a way that sentence boundaries are not assumed to be known beforehand, the 
performance degrades only minimally. This means that this algorithm can actu- 
ally contribute to finding sentence boundaries. And third, if there are sequences 
of unknown word forms in the text, the algorithm takes better guesses than the 
Church algorithm (examples are given in Rapp 1995, p. 155). When about 2% 
of the word forms in the test corpus were randomly replaced by unknown word 
forms, the quality of the results for the Church algorithm decreased by 0.7% 
for the small and by 2.0% for the large tag set. The respective figures for the 
variable-context algorithm are 0.9% and 1.3%, which is better overall. 

In a further experiment, the contribution of the lexical probabilities to the 
quality of the results was examined. Without the lexical probabilities, the results 
decreased by 0.3% (small) and 0.6% (large tag set) for the Church algorithm, 
the respective figures for the variable-context algorithm were 0.9% and 0.0%. 

5 The speed of the tagger (including morphological analysis) is about 20 word forms per 
second for the large and 100 word forms per second for the small tag set on a fast PC. 
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Die 


Frau 


bringt 


das 


Essen 




ART DEF 


SUB 


VER 


ART DEF 


SUB 


SZE 



Ich 


mcine 


meinc 


Frau 




PER PRO 


VER 


POS ATT 


SUB 


SZE 



Windc 


das 


im 


Windc 


flattcrndc 


Segel 


um 


die 


Windc 


SUB 


ART DEF 


PRP 


SUB 


ADJ 


SUB 


PRP 


ART 


SUB 



Die 


Frau 


bringt 


ART DEF NOM SIN FEM 


SUB NOM FEM SIN 


VER 3PE SIN 



das 


Essen 




Ich 


ART DEF AKK SIN NEU 


SUB AKK NEU SIN 


SZE 


PER NOM SIN 1PE 



meine 


meine 


Frau 




VER 1PE SIN 


POS AKK SIN FEM ATT 


SUB AKK FEM SIN 


SZE 



Windc 


das 


im 


VER 3PE SIN 


DEM NOM SIN NEU PRO 


PRP DAT 



Windc 


flattcrndc 


Segel 


SUB DAT MAS SIN 


PA1 SOL NEU AKK PLU 


SUB AKK NEU PLU 



um 


die 


Winde 




PRP AKK 


ART DEF AKK SIN FEM 


SUB AKK FEM SIN 


SZE 



Figure 4: Tagging example for both tag sets - the ambiguity rates amount to 
2.4 tags per word for the small and 8.8 tags per word for the large tag set (errors 
are printed bold type). 

5 Conclusions 

We have compared two different tagging algorithms and two different tag sets. 
The first tagging algorithm is the Church algorithm which uses trigrams to 
compute contextual probabilities. The second algorithm, the so-called variable- 
context algorithm, has been described in paragraph 3. The smaller of the two 
tag sets contains 51 parts-of-speech, the larger tag set includes additional gram- 
matical features such as case, number and gender. The small tag set is a subset 
of the large tag set. 

In comparison with the Church algorithm, the variable-context algorithm pro- 
duces similar results for the small tag set, but significantly inferior results for 
the large tag set. On the other hand, the performance of the variable-context 
algorithm for the large tag set improves faster with increasing size of the train- 
ing corpus than the performance of the Church algorithm. Thus, with tagging 
more training texts manually, similar results are to be expected for the two 
algorithms. 

Considering the two tag sets, the results for the small tag set are significantly 
better. Nevertheless, with increasing size of the training corpus an approxima- 
tion of the results seems to be possible. 
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One of our aims for the near future is to use the output of the tagger for 
lemmatization. In this way a sentence like Ich meine meine Frau. could be 
unambiguously reduced to ich / meinen / mein / Frau. 
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